SQuaScheD -- Additional Material

SQuaScheD - Unsupervised Schema Discovery for Heterogeneous Data

We provide on this web site additional material related to the article "Here is the Data. Where is its Schema?" submitted to the 24th International WWW Conference 2015 as submission #312.

You will find below the detailed hierarchies, ground truth class distributions and MDL evolution for hierarchies discovered by SQuaScheD on all datasets mentioned in the paper.

Detailed SQuaScheD Discovered Hierarchies

We present below interactive visualizations showing the most representative attributes and entities for each class of the SQuaScheD discovered hierarchies for each datasets.

We can observe that the divergence between the SQuaScheD results and the ground truth is mainly due to two factors: First, some leaf classes in the ground truth can be further divided into subclasses indeed. For instance, election can be divided into & state election and general election; Event can divided into events in different locations, such as US, Korea, China, etc. Second, there are meta attributes in the data that may mislead the discovery process. For example, some entities are assocated with images and some entities are not; as an image has multiple attributes, such as image_size and url, it may force Squasched to divide the entities into a class with image and a class without image.

Ground Truth Class Distribution

Distribution of the bottom-most ground-truth class in the discovered class hierarchy for all datasets.

Hint: you can click on the images to enlarge them.

ActivityEducationalInstitution

Ground Truth

spectral

ArchitecturalStructure

Ground Truth

spectral

Event

Ground Truth

spectral

Event_NaturalPlace_WrittenWork

Ground Truth

spectral

Infrastructure

Ground Truth

spectral

RouteOfTransportation

Ground Truth

spectral

Species

Ground Truth

spectral

Tunnel

Ground Truth

spectral

MDL Evolution in SQuaScheD

The figures below show the evolution of the MDL, class-precision, -recall, and -F2 along the steps of the SQUASCHED process for all datasets.

ActivityEducationalInstitution

ArchitecturalStructure

Event_NaturalPlace_WrittenWork

Event

Infrastructure

RouteOfTransportation

Species

Tunnel